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ABSTRACT 

This  paper  is  an  extension  of  the  work  of  Shemer  and  Gupta  [k]   on 
the  effect  of  multiple  independent  memory  modules  on  processor  efficiency  and 
throughput.   Their  conclusions,  although  valid,  were  unnecessarily  restricted. 
The  results  derived  herein  expand  them  to  include  instructions  of  differing 
execution  times,  instructions  which  do  not  require  memory  access,  single  as 
well  as  multiple  instruction  fetch  and  single  as  well  as  multiple  word  data 
fetches,  plus  the  effects  of  a  non-Poisson  high-rate  input-output  device  such 
as  a  swapping  disk.   An  example  of  the  application  of  the  results  to  the 
ILLIAC  IV  computer  system  is  given. 
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INTRODUCTION 

This  paper  is  an  extension  of  the  work  of  Shemer  and  Gupta  [h~\    on  the 
effect  of  multiple  independent  memory  modules  on  processor  efficiency  and 
throughput.  Their  conclusions,  although  valid,  were  unnecessarily  restricted. 
For  example,  their  results  were  derived  under  the  clearly  unrealistic  assump- 
tion that  the  execution  times  of  all  processor  instructions  were  equal.  The 
results  derived  herein  expand  Shemer  and  Gupta's  results  in  that  instructions 
of  differing  execution  times,  instructions  which  do  not  require  memory  access, 
single  as  well  as  multiple  instruction  fetch,  and  single  as  well  as  multiple 
word  data  fetches  are  permitted  and  the  effects  of  a  non-Poisson  high-rate 
input-output  device  such  as  a  swapping  disk  are  included. 
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SYSTEM  MODEL  AMD  OPERATION 


Consider  the  system  configuration  shown  in  Figure  1.   The  processor 
and  input -output  controller  have  independent  data  paths  for  communication  with 
each  of  the  N  (the  notation  of  Shemer  and  Gupta  [h]   will  be  used  insofar  as 
applicable)  independent  memory  modules  which  form  the  primary  memory  of  the 
system.   In  order  to  minimize  data  queue  lengths  in  the  input-output  controller 
(typically  only  a  one  word  queue  is  implemented)  and  to  prevent  obliteration 
of  waiting  data  by  other  closely  following  data,  memory  cycle  requests  from 
the  input-output  controller  are  given  non-preemptive  priority  over  those  from 
the  processor. 

M  contiguous  words  are  fetched  simultaneously  at  any  memory  access. 
For  M  >  1,  this  corresponds  to  the  concept  of  fetching  an  M-word  superword. 
Addresses  are  interlaced  among  the  independent  memory  banks  such  that  address 
"a"  is  in  memory  bank  (a  div  M  mod  N) .   The  processor  fetches  M  instructions 
simultaneously,  and  M'  of  these  require  an  operand  from  memory.   Thus  each  M 
instructions  require  M'  +  1  memory  cycles.   It  is  assumed  that  all  memory 
cycles  require  time  C. 

Let  the  instruction  set  of  the  processor  consist  of  a  repertoire  of 

R  instructions,  g  ,  g  ,  . ..,  g  of  length  r  ,    r  ,  . ..,  r  respectively.  We 
_L    c.  R  -L    d  R 

add  to  this  instruction  set  the  instruction  fetch  operation  g  ,  which  requires 
time  r  between  initiation  and  execution  of  the  first  fetched  instruction. 

Some  of  the  R  instructions  require  memory  accesses  during  their 
execution,  others  do  not.   Without  loss  of  generality,  let  us  order  the  ele- 
ments of  the  instruction  set  such  that  the  first  0,-1  instructions  require  a 
memory  cycle.  We  then  define 

G'  =  {g.  |  0  <  i  <  Q  -  1} 
G"  =  {g.  |  Q  <  i  <  R}  . 

To  convert  this  instruction  set  G  =  G'UG"  into  a  memory-oriented  set 
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Figure  1.   System  Configuration 
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of  operations,  we  define 

h.k  =  {(g.,  gx,  gy,  ...,   gz,  G')  |  g±e  G',  hx,  hy,  ...  hz  e  G"}. 

k  th 

That  is,  h.   is  a  n-tuple  consisting  of  the  i   processor  instruction  which 

requires  a  memory  cycle,  followed  by  some  number  of  non-memory  accessing  in- 
structions, and  terminated  by  any  instruction  which  does  require  a  memory 
access.   The  superscript  k  is  simply  a  member  of  the  index  set  which  allows  us 
to  record  all  possible  sequences  of  the  instruction  g.  followed  by  some  number 
of  non-memory  reference  instructions  before  the  next  instruction  requiring  a 
memory  access. 

Associated  with  each  h.   there  is  an  execution  time 

1 

k 
t.   =r.  +  r  +r  +  . . .  +  r 

1     1    x    y         z 

(note  that  it  does  not  include  the  time  to  execute  the  terminating,  memory - 

referencing  instruction).   Each  h.   occurs  with  a  certain  probability, 

PR  Th.  1 ,  in  an  instruction  stream. 
i 

We  now  define 

h.  =  U  h.k  . 
i   k   i 

The  set 


H  =  {h.} 

is  then  a  memory-oriented  pseudo-instruction  set  for  the  processor.   Associated 
with  each  pseudo-instruction,  h.,  is  a  probability 

Pr  [h.l  =  I  Pr  [h.k]  , 
k 


and  an  average  execution  time 


t.  =  E  Pr  [h.k]  t.k  . 

k 
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le  average  duration  of  a  pseudo-instruction  is  the  expectation  of  t.  , 

Q-l 

E  [T]  =  E   Pr  [h.]  t.  . 
i=0      x  x 

A  number  of  questions  arise: 
lat  is  the  probability  that  a  processor's  memory  access  will  be  delayed 
scause  of  input-output  traffic,  as  a  function  of  that  traffic  density? 
irther,  what  is  the  average  duration  of  such  a  delay,  and  how  does  the  per- 
^rmance  of  the  likelihood  that  the  processor  will  be  delayed,  either  by  it- 
2 If  or  by  an  input-output  transaction?   In  the  following  section,  the  analysis 
Lll  seek  to  answer  these  questions. 
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ANALYSIS 


An  exact,  detailed  analysis  would  be  extremely  difficult,  requiring 
differing  conditional  relations  for  each  step  of  a  program  describing  the 
previous  history  of  the  program  and  its  environment  and  the  current  state  of 
all  components  of  the  system.   The  system  is  therefore  examined  on  a  statisti- 
cal basis  assuming:   a)   stationary  conditions,  b)   generation  of  memory 
requests  by  the  input-output  controller  is  characterized  by  a  combination  of  a 
Poisson  process  and  a  high  speed  device  (such  as  a  swapping  disk)  generating 
requests  at  a  uniform  rate,  c)  if  the  processor  is  denied  a  memory  access 
during  execution  of  an  instruction,  it  will  wait  rather  than  proceed  with  any 
other  task. 

The  input-output  controller  is  assumed  to  be  simultaneously  regula- 
ting and  providing  memory  service  for  a  large  number  of  peripherals  such  as 
tapes,  printers,  card  equipment,  and  teletypes.   Since  memory  requests  arise 
from  a  large  number  of  independent  sources,  the  generation  of  memory  commands 
by  the  input-output  controller  is  considered  to  be  partially  governed  by  a 
Poisson  phenomenon  with  cumulative  request  rate  \.      It  is  assumed  that  the 
input-output  controller  also  services  a  high  speed  device  which  generates 
memory  access  requests  with  uniform  rate  7  when  actuated.   This  device  has  a 
probability  of  being  actuated  (i.e.,  a  duty  cycle)  of  Pr  [X]. 

Since  there  are  N  independent  memory  banks  and  addresses  are  inter- 
laced among  them,  it  is  reasonable  to  assume  that  the  cumulative  demands  X   and 
y   are  evenly  divided  among  all  N  memories.   Since  each  memory  cycle  is  of 
length  C,  the  input-output  controller  imposes  a  load 


P 


\C        Pr  [X]  yC 
N       N 


(\  +  PR  [X]  y)C 

N 


|£      where  (;  =  \   +  PR  [X]  y   . 
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Therefore  -under  stationary  conditions  with  p  <  1,  the  probability 

that  k  out  of  N  memory  modules  are  not  busy  is 

t,  ri  t    /N\  /-,    sk  N-k 
Pr  [k]  =  Ik]  (1  "  p)  p 

We  denote  by  "A"  the  event  that  the  processor  does  not  experience  delay  in 

memory  access  due  to  input-output  servicing.   Assuming  independence  between 

processor  and  input-output  memory  access  addresses, 

Pr  [A  |  k]  =  k/N. 

Averaging  over  all  values  of  k,   we  obtain 

N 

E 
k=: 

Closing  the  series, 


D  rnl     y       k  (Nl/,    xk  N-k 
11  |AJ     \     N  U/(l  "  P)   P 


Pr  [A]  =  (1  -  p). 

This  should  not  be  a  surprising  result,  since  if  p  is  the  probability  that  a 
memory  bank  will  be  busy  servicing  an  input-output  controller  request  then 
1  -  p  should  be  the  probability  that  the  bank  is  available  to  the  processor. 

Even  if  the  processor  requests  a  memory  cycle  from  a  memory  bank 
which  is  not  servicing  an  input-output  controller  command,  it  is  still  possi- 
ble for  the  servicing  of  the  request  to  be  delayed  if  the  memory  bank  is  still 
busy  servicing  a  previous  processor  request. 

Let 

S  =  { S^  ,  Sg  ,  . .  . ,  Sq  } 

where  S.   is  the  k-tuple  of  memory-oriented  pseudo-instructions 
J 

(h  ,  h  ,  ...,  h  ),  x,  y,  ...z  e  {0,  1,  2,  ...,  0,-1},  and  j  is  the  image  of 
x   y        z 

x,  y,  ...,  z  in  a  1-1  mapping  of  k- tuples  of  non-negative  integers  less  than 
Q  onto  the  positive  integers  not  greater  than  Q,  . 

T.   =t  +t  +  .. .  +  t 

J     x    y  z 

is  the  time  required  to  execute  the  pseudo- instruction  sequence  S .   if  no  delay 

_  k  k      ^ 

occurs.   The  expectation  E  [T  ]  is  the  average  value  of  T .  , 

J 
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Qk 
E  [T*]  =  Z   Pr  [S.k]  T  k  . 

0=1      J    J 

Now  consider  the  execution  of  a  sequence  of  k  memory-oriented 
pseudo-instructions.   Let  Pr  [I,1]  be  the  probability  that  the  input-output 
controller  requests  access  to  a  particular  memory  bank  due  to  the  high  speed 
swapping  device,  and  Pr  [I  "]  the  probability  of  an  input-output  generated  cycle 
in  that  same  memory  due  to  any  of  the  other  input-out  devices.   Then 

k  m  k 

Pr  [I  ']  =  Z   Pr  [S.K]  Pr  [X]  min  (l,  —f-) 


3=1 

and 


Pr  [L  "]  =  Z   Pr  [S,k]  (  1  -  exp  (-  \   T  Vn))   • 

These  probabilities  are  of  interest  in  the  following  work  for  small  values  of 
k  only;  contemporary  primary  and  secondary  memory  speeds  result  in 

N 
for  small  k.   Therefore, 

Pr  [Ik']  =  Pr  [X]   Z   Pr  [  S  .k]  §  T  .k 

=  Pr  [X]  g  E  [Tk]  . 

The  cumulative  request  rate  X  is  quite  small  also.   Typically,  each 

word  transferred  to  a  computer's  memory  by  an  input-output  device  (other  than 

a  swapping  disk  which  has  been  considered  separately)  will  require  several 

processor  cycles  of  manipulation.   Therefore,  \  «   1  and  for  small  k  the 

approximation 

1  -  exp  (-  \  T.k/N)  =  X  T  k/N 
J  CI 
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;an  "be  used.  Then 

^      k  k 

Pr  [I  "]  =  I   Pr  [ST|  (\  /N)  T  * 

k     0=1       J  J 

=   (\  /N)  E  [Tk]   . 

Let  K_  denote  the  event  that  no  input-output  controller  memory  access 
occurs  in  a  particular  memory  bank  during  the  execution  of  k  pseudo-instruc- 
tions by  the  processor.   Then 

Pr  [ly  =  1  -  Pr  [Ik*]  -  Pr  [1^']  . 

^Lgain  for  small  values  of  k  which  are  of  interest, 

Pr  [K^  =  1  -  Pr  [X]  (y   /N)  E  [Tk]  -  (\  /n)  E  [Tk] 
=  1  -  (E  [Tk]/N)  (\  +  y   Pr  [X])  . 

Jsing  the  previously  defined  ^  =  \  +  7  Pr  [X], 
Pr  [iy  =  1  -  (g  /N)  E  [Tk]  . 

[he  probability  of  occurrence  of  one  of  the  Q,  pseudo-operation  codes  in,  e.g., 

th 
the  i   position  of  an  instruction  sequence  of  length  k  is  certainly  not  un- 
corrected with  the  occurrence  of  certain  operations  in  the  other  k-1  posi- 
tions.  For  example,  a  store  operation  is  almost  certainly  followed  by  a  load 
and  was  (particularly  for  compiler  generated  code)  likely  preceded  by  a  series 
of  arithmetic  operations.   Nevertheless,  it  is  a  reasonable  first-order  ap- 
proximation to  assume  the  occurrence  of  operation  codes  in  various  positions 
3f  the  sequence  to  be  uncorrelated  with  the  occurrence  of  instructions  in 
other  positions.  Then 


and 


E  [Tk]  =  k  E  [T] 

Pr  [ly  =  1  -  ££■  E  [T]  . 

th 
If  the  processor  is  delayed  because  the  i   preceding  pseudo- 

Lnstruction  required  access  to  the  same  memory  bank,  let  this  be  denoted  by  B. 
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For  example 

Pr  [Bx]    =    (1/N)    Pr  [A]    Pr  [1^]    Pr  [K^    , 

where  Pr  [L  ]    is  the  probability  that  the  previous  pseudo-instruction  was  of 
length  less  than  C.      Similarly 

Pr  [B2]    =    (1/N)    Pr  [A]    Pr  [Lg]    Pr   [A]    (^)    Pr  [Kg] 
and  in  general 


Pr  [Bk]    =   (1/N)    Pr   [A]    Pr   [L^    (pr   [A]    (^)) 


Pr 


[^ 


Thus  the  probability  that  the  processor  is  undelayed  by  any  previous  processor 
memory  access,    denoted  by  the  event  B,    is 


Pr  [B]    =  £-t£L         E       Pr  [Lk]    Pr   [iy  ( Pr   [A]    (Sj*)) 


k-1 


k=l 

Pr  [LJ    Pr   [Kj  (  Pr  [A]    (2=±)  J 
k=l 


=  nTI      ^     ^V  ^[^(^[a]  (^))k    . 


The  evaluation  of  Pr  [L  ]  is  in  general  a  difficult  task  since  the  probability 

curve  has  the  appearance  shown  in  Figure  2  (T«  is  the  execution  time  of  the 

longest  pseudo-instruction,  T  the  execution  time  of  the  shortest).   One 

s 

simplifying  assumption  is  that  Pr  [L  ]  in  the  region  between  k  and  k  can  be 
expressed  as  a  simple  linearly  decreasing  function 

Pr  [Lk]  =  (k2  -  k)  /  (kg  -  k^  ,  kx  <  k  <  kg  , 

perhaps  with  judicious  juggling  of  the  values  of  k  and  k  if  the  instruction 
set  includes  rarely  used,  extremely  short  or  long  operation  codes.  With  such 
an  approximation,  the  series  can  be  closed.   The  form  of  the  closed  expression 
is  so  complex,  however,  as  to  obscure  simple  observation  of  the  effects  of 
£,  N,  and  processor  speed  on  the  likelihood  of  processor  delay. 

We  note  that  a  computer's  instruction  set  invariably  includes  sever- 
al instructions  which  require  time  considerably  in  excess  of  one  memory  cycle 
for  execution.   In  that  case  k  =  1  and  the  Pr  [L,  ]  curve  may  appear  as  in 
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Figure  2.      Generalized  Pr[L  ]   Curve 
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Figure  3-   Typical  Pr[L,  ]  Curve  With  Exponential  Approximation 
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Figure  3-  We  therefore  make  the  alternative  simplifying  assumption  that 
Pr  [L,  ]  can  be  approximated  by  the  exponential 

Pr  [Lk]  =  c^  exp  (-  a2   k) 

where  a.     and  a     are  constants  adjusted  to  best  fit  the  actual  Pr  [L  ]  curve. 

Substituting  for  Pr  [L  ]  and  Pr  [K,  ]  in  the  expression  for  Pr  [B], 

Pr  [B]  =  jjL   E  a±   exp  (-a2  k)  (  1-  S.  E  [T]  )(pr  [A]  2j*  ) 

Again,  for  small  values  of  £, 

1  -  |£  E  [T]  =  exp  (-kg  E  [T]/n) 


and 


Pr   [B]   =  jj~       Z        (exp   (-a2   -g   E  [T]/N)    Pr  [A]   ^±) 
k=l 


Define 


P   =  exp   (-QL    -   g   E   [T]/N)    Pr   [A]    (N-l)/N   . 


Then 


(X,  oo 

Pr 

k=l 


and  closing  the  series, 


a  p 

I*  [B]  =  5PJ  "^ 

-«    -£E  [T]/N 
(1  -  p)  (<h)   e   2  e 

N 

"  ~2        -£E  [T]/N 

/n       v   /N-1n 

i  -  (i  -  p)  ("Y")  e   e 

We  note  that  for  N  =  1,    this  simply  reduces  to  the  expression  for 


Pr  [B]_] 


-a         -£E  [T]/N 
Pr  [B  ]  =  (1  -  p)  a  e     e 

-12- 


which  is  simply  the  probability  that  the  previous  pseudo-instruction  was  less 
than  C  in  length,  diminished  by  the  probability  that  an  input-output  transac- 
tion has  occurred  or  is  pending. 

The  probability  that  a  processor  is  delayed  due  to  any  cause  is 
Pr   [D]   =   Pr  [A*   or  B]    =   (l  -   Pr  [A])   +   Pr  [B] 
where  Pr  [A']   =  1  -   Pr  [A]    is  the  probability  that  the  processor  is   delayed 
due  to  input-output   servicing. 

(1  -   p)    (ql/n)    exp   (-a  )    exp   (-£E  [T]/N) 
Pr  [D]    =  p  + 


1  -  (1  -  p)  (^i)  exp  (-a2)  exp  (-£E  [T]/N) 

If  event  D  occurs  (i.e.,  the  processor  is  delayed),  it  is  natural  to 
consider  the  length  of  the  delay.   First  consider  the  delay  if  the  interfer- 
ence is  caused  by  the  input-output  controller.   Let  W  be  the  length  of  the 
delay  resulting  from  input-output  controller  access  of  a  memory  bank;  define 

T  to  be  the  time  remaining  for  completing  an  in-process  memory  cycle  when  the 
c 

processor  request  occurs,  and  let  E  [n  ]  denote  the  expected  number  of  input- 
output  memory  access  requests  queued  for  this  memory.   The  mean  value  of  ¥  is 
then  given  by 

E  [W  ]  =  E  [T  ]  +  E  [n_]  C  +  &   E  [W  ]   . 
l  cj     L  cJ     L  1J      N   L  c 

The  three  terms  of  this  expectation  are  the  average  amount  of  time  needed  to 
finish  an  in-process  command,  the  average  time  to  complete  the  remainder  of 
the  queued  requests,  and  the  average  time  required  to  service  any  new  commands 
which  arrive  while  the  other  requests  are  being  serviced  (these  new  requests 
will  be  serviced  ahead  of  the  processor's  request). 

To  arrive  at  E  [W  ],    E  [n  ]  and  E  [T  ]  must  be  obtained.   The  average 
length  of  the  queue,  E  [n  ],  is  related  to  the  average  time  an  arriving  re- 
quest must  wait  in  the  queue,  E  [W-,  ]  : 


E  [wy  =  E  [Tc]  +  |£  E  [W1]   • 


Here  the  first  term  is  again  the  time  required  to  complete  an  in-process 
command,  and  the  second  is  the  time  needed  to  empty  the  queue.   From  this, 
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E  [Tc]    E  [Tc] 
E  [W1]  =  i^c7n  =  37"  ' 

Since  the  system  is  in  equilibrium,  the  average  size  of  the  queue  [3]  is  simply 
the  number  of  additional  requests  which  arrive  while  a  request  is  waiting  for 
service, 

,  ?  E  [T  ] 

E  [nj  .  |  E  [Wl]  .  -rjI-^ 

Substituting  this  into  the  equation  for  E  [W  ]  and  solving, 

E  [Wc]  =  E  [Tc]  /  (1  -  p)2   . 

The  arrival  input-output  memory  access  requests  are  uncorrelated 
with  respect  to  both  other  input-output  requests  and  processor  commands. 
Therefore  the  probability  of  the  arrival  of  a  request  when  a  memory  bank  is 
busy  may  be  assumed  to  be  uniformly  distributed  over  the  memory  cycle  time  C, 
whence  the  expected  waiting  time  for  access  is  simply 

E  [Tc]  =  C/2  . 

Thus  n 


E  [V  =2(1-  of 


th 
Now  consider  the  processor  delay  due  to  the  k    previous  pseudo- 
o- 
bstruction being  directed  to  the  same  memory  bank.   Let  W   be  the  duration 

k  p 

of  such  a  delay,  and  E  [T    be  the  average  time  to  complete  a  previously 

initiated  memory  cycle.   To  this  completion  time  we  add  the  delay  due  to  any 

input-output  transactions  arriving  during  the  period: 

E  [Wpk]  =  E  [Tpk]  +  (g/N)  (C  -  E  [Tpk]  +  E  [Wpk])  C   , 


from  which 


pC  +  (1  -  p)  E  [T  k] 

E  [W  k]  =  E- 

P  (1  -  P) 

=  pC/(l  -  p)  +  E  [Tpk] 
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Given  that  the  processor  is  delayed,  the  expected  value  of  the  waiting  time  is 

therefore  ^ 

Pr  [A']  E  [W  ]  +   S   Pr  [B  ]  E  [W*] 
k=1                P 
■  [W  I  D]  -  p^ 

The  expected  time  to  complete  a  memory  cycle  due  to  the  k   previous 
pseudo-instruction  is  in  general  as  intractable  to  analysis  as  the  previous 
quantities  we  have  used.  However,  it  is  not  unreasonable  to  assume  that  this 
expectation  is  also  exponentially  related  to  k: 

E  [T  k]  =  C  exp  (-OL  k)   . 
P  3 

Unfortunately ,   a     is  not  in  general  directly  related  to  the  cc's  used  in  the 

approximation  to  Pr  [L  ],  and  hence  must  be  separately  determined.  We  define 

k 

a ,  =  a     +  oc       .   Closing  the  series  in  the  expression  for  Pr  [W  |  D], 

00 

S   Pr  [Bk]   (pC/(l  -  p)  +  E  [T  *]) 


k=l 


yi-   £   exp  (-a2  -  ?E  [T]/N)|pr  [A]  y^} 

00  k 

+   Z   exp  (-d.  -  £E  [T]/N)(Pr  [A]  (N  -  l)/N  ) 
k=l 


(pCQ^/N)  exp  (-a2   -  £E  [T]/N) 


1  -  ($=k    exp  (-QL  -  ^E  [T]/N) 


N 

(1  -  p)  (COLjTS)    exp  (-C^  -  ^E  [T]/N) 
1  -  (1  -  p)  (^i)  exp  (-ah  -   ^E  [T]/N) 

The  total  expected  time  to  execute  an  M  +  1  step  sequence  of  instruc- 
tion fetch  and  execution  can  now  be  derived.   Recalling  that  only  M'  +  1  of 
these  actions  require  memory  access,  the  probability  of  j  delays  in  the 
(M1  -  l)  -  cycle  sequence  is 


Pr  [j]  =  (M'  +.    1)  (Pr  [D])J(1  -  Pr  [D]) 


M'+l-j 

U      V-"-  -  Jtj.  LJjjy 


The  expected  time  to  execute  the  M'  +  1  cycle   sequence  when  j    delays  occur  is 

E  [S    |    j]    =    (M'   +    l)    E  [T]   +   j    E   [W    |    D]       . 
Hence  the  expected  time  to  execute  the  sequence  is 

M'+l 
E  [S]   =       I  Pr   [j]    E  [S    |    j] 


z+1    (M'm1)   Pr  [D]^  (1  -  PR  ^DM'+1";3  ((M'+1)  E  [T]  +  d  E  [W  '  D]) 


M'  +  l 
Z 


M'+l         /M,+1x  (l-Pr[D])M,+1^ 

=   (M'+l)   E  [T]  +  E  [W   |    D]        S         J  i        Pr  ^ 

j=0  \      J    ' 


(M'+l)    E   [T] 

M'+l 


M* 

z 

=   (M'+l)    (E  [T]   +   E  [W    |    D]    Pr  [D]) 


+   E   [W    I    D]    (M'+l)    Pr   [D]     Z         (     *'   )      Pr   [D]^1    (l   -   Pr   [D])1 

j=0   V  J  x  / 

M'  /   ,  \ 

=  (M'+l)  E  [T]  +  E  [W  |  D]  Pr  [D]  Z   (  J  J  Pr  [D]J  (l  -  Pr  [D])M'"j 


This  is  again  an  intuitive  re suit- the  expected  delay  should  simply  he  the 
expected  delay  per  cycle  times  the  number  of  cycles. 
Expanding  this  result  and  substituting, 
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E  [S]  =  (M'+l)  (e  [T]  +  (1  -  Pr  [A])  E  [WQ] 


k=l 


-■-  '  '  ]  ™  +  2^kf 


(pOQ^/N)  exp  (-a2  -  £E  [T]/N) 
1  -  p  (Sli)  exp  (-a2  -  £E  [T]/N) 

(1  -  p)  (Cc^/N)  exp  (-0^  -  £E  [T]/N) 
1  -  (1  -  p)  (^)  exp  (-a^  -  ^E  [T]/N) 


In  the  case  where  input-output  activity  is  absent  (i.e.,  £  =  p  =  0),  the  above 
expression  reduces  to 

/  (Cql/n)  exp  (-a,  )  v 

E  [S]  =  (M'+l)  /E  [T]  + 


1  -  (~r-)  exp  (-a,,) 


Barring  a  transfer  of  control  during  the  sequence  of  instructions, 
the  average  execution  time  for  M  instructions  is  then  E  [t]  =  E  [S]/M 
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EXAMPLE 

The  ILLIAC  IV  Processing  Unit  [1,  2]  will  be  used  to  demonstrate 
results  from  the  foregoing  analysis.   Table  1  shows  a  breakdown  of  instruc- 
tion types  used  in  several  ILLIAC  IV  codes  (timing  information  in  this  and 
subsequent  work  is  given  in  clock  periods).   Table  2  shows  these  instructions 
partitioned  into  the  previously  defined  sets  G'  and  G". 

ILLIAC  IV  fetches  16  instructions  simultaneously.   It  is  assumed 
that  half  of  these  are  Advast  instructions  which  do  not  enter  into  the  calcula- 
tions.  Hence  M  =  8  for  this  analysis  and  Pr  [g  ]  =  11%.   The  probabilities  of 
the  other  instructions  have  been  reduced  by  one  ninth  to  include  the  fetch 
cycle.   Since  memory  reference  and  non-memory  reference  instructions  are 
evenly  divided,  M'  =  k. 

It  is  assumed  that  the  Final  Instruction  Queue  (Finq)  is  kept  suf- 
ficently  full  that  the  only  delays  are  due  to  PE  Memory  conflicts.   An  excess 
of  Advast  instructions  or  such  operations  as  LOAD  and  BIN  are  excluded  in  this 
analysis;  their  inclusion  can  only  further  delay  the  PE.   The  time  to  execute 
any  pseudo-instruction,  h.,  is  given  by 

T  =  r  +  5-5   Z   (0.57)  k  (0.1*3) 
k=0 

=  r.  +  k.l     . 

i 

The  set,  H,  of  memory-oriented  pseudo-instructions,  along  with  their 
probabilities  of  occurrence  and  average  execution  times  is  shown  in  Table  3« 
Assuming  perfect  memory  overlap  (i.e.,  memory  access  delays  the  PE  only  be- 
cause of  closely  following  sequential  memory  requests,  rather  than  a  lack  of 
overlap  access  paths),  an  exact  calculation  of  Pr  [B, ]  (with  £  =  p  =  0)  and 
E  [T  ]  is  given  in  Table  k.      From  these,  using  C  =  7  clocks,  OL    =   1, 
exp  {-a)   =  O.55,  exp  (-a   )  =  O.78,  and  exp  (-a.  )  =0.1*3  are  calculated. 

The  expected  duration  of  one  block  of  instructions  (again  under  the 
assumption  of  no  input-output  activity)  is 

E  [S]  =  (k   +  1)  (7.2  +  3-0) 

=  51  clocks 
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Number  of  Clocks 

Assuming  Perfect 

PE  Operation 

Probability 

Overlap 

Add/ Subtract 

from  memory 

10$ 

7 

from  registers 

6$ 

7 

Multiply 

from  memory 

9$ 

9 

from  registers 

% 

9 

Divide 

2$ 

56 

Load 

18$ 

1 

Register  to  register 

18$ 

1 

Route 

10$ 

k 

Mode  Register  Comparison 

and  Bit  Operations 

9$ 

1 

Store 

13$ 

1 

Table  1.   ILLIAC  IV  PE  Instruction  Usage 


-19- 


Memory  Instructions  (G1) 


Non-Memory  Instructions  (G") 


Number  of 

.  Number  of 

Probability 

Clocks 

Probab: 

Llity 

Clocks 

g0       11* 

1 

g5 

if* 

7 

gx     10* 

7 

g6 

i+* 

9 

g2        8* 

9 

g7 

2* 

56 

g3     16* 

1 

g8 

16* 

1 

g^       12* 

l 

S9 

9* 

If 

g10 

8* 

1 

Totals 


57* 


^3* 


Table  2.   Instruction  Sets 


Pseudo-instruction 


K 


K 

h, 
h, 


Probability  (*) 

Number  of  Clocks 

19 

5 

18 

11 

lU 

13 

28 

5 

21 

5 

Mean  execution  time  E[T]  =  7«2  clocks 


Table  3-   Memory-oriented  Pseudo- in struct ion  Set 
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X 


X 


X 


X 


X 


B  Events 

Probability 

Time 

to 

Complete 

G' 

39-0 

6 

g9 

GT 

3-5 

2 

gy 

G' 

9.k 

5 

2 

G' 

2.2 

U 

gy 

Pr 

G' 
[Bx] 

•5 
5h.6 

3 

Note: 


gx=  (g^Ug^),   g  =  (ggU  giQ) 


Mean  wait  if  delayed  =  E[T  ]  =  5--5  clocks 

Mean  average  delay   =  E[T  ]  Pr[B..]  =  3»0  clocks 


Table  h.      Probability  of  Processor  Self -interference  and  Expected  Delay 
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This  corresponds  to  an  average  instruction  execution  rate  (over  the  set  G)  of 

E  [t]  =  6.J+  clocks  per  instruction 
as  opposed  to  the  ideal  execution  time 

E  [r]  =  k.5   clocks  per  instruction 

Thus  it  is  seen  that  the  ILLIAC  IV  Processing  Element  efficiency  is  degraded 
"by  almost  kjfo   due  to  the  memory  interference.   Note  that  about  one  quarter  of 
this  is  due  to  instruction  fetching.   Small  kernels  of  code  which  can  be 
entirely  contained  in  the  instruction  look-ahead  store  are  clearly  advanta- 
geous. 

Under  the  assumption  that  the  exponential  approximations  used  in  the 
analysis  are  valid  for  the  PE  instruction  distribution,  the  expected  instruc- 
tion block  execution  time  for  two  independent  memory  banks  per  PE  becomes 

E  [S1]  =    (k  +   1)    (7.2  +  1.5/(1.0  -  0.21)) 
=  V?  clocks 
The  average  instruction  execution  rate  would  then  be 

E  [t1]  =  5.6  clocks  per  instruction, 
with  memory  conflicts  causing  a  2k%   reduction  in  processor  efficiency. 
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